Overview

Dataset statistics

Number of variables9
Number of observations5490
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory386.1 KiB
Average record size in memory72.0 B

Variable types

Numeric8
Categorical1

Alerts

Mực nước KG is highly overall correlated with Mực nước LT and 3 other fieldsHigh correlation
Mực nước LT is highly overall correlated with Month and 2 other fieldsHigh correlation
Mực nước DH is highly overall correlated with Mực nước KG and 1 other fieldsHigh correlation
Lượng mưa KG is highly overall correlated with Mực nước KG and 2 other fieldsHigh correlation
Lượng mưa LT is highly overall correlated with Mực nước KG and 2 other fieldsHigh correlation
Lượng mưa DH is highly overall correlated with Lượng mưa KG and 1 other fieldsHigh correlation
Month is highly overall correlated with Mực nước LTHigh correlation
Mực nước DH has 67 (1.2%) zerosZeros
Lượng mưa KG has 2489 (45.3%) zerosZeros
Lượng mưa LT has 2548 (46.4%) zerosZeros
Lượng mưa DH has 2493 (45.4%) zerosZeros

Reproduction

Analysis started2022-12-03 15:18:51.330426
Analysis finished2022-12-03 15:19:03.213519
Duration11.88 seconds
Software versionpandas-profiling vv3.5.0
Download configurationconfig.json

Variables

Year
Real number (ℝ)

Distinct45
Distinct (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1998
Minimum1976
Maximum2020
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size43.0 KiB

Quantile statistics

Minimum1976
5-th percentile1978
Q11987
median1998
Q32009
95-th percentile2018
Maximum2020
Range44
Interquartile range (IQR)22

Descriptive statistics

Standard deviation12.988356
Coefficient of variation (CV)0.0065006787
Kurtosis-1.2011868
Mean1998
Median Absolute Deviation (MAD)11
Skewness0
Sum10969020
Variance168.69739
MonotonicityIncreasing
Histogram with fixed size bins (bins=45)
ValueCountFrequency (%)
1976 122
 
2.2%
1999 122
 
2.2%
2001 122
 
2.2%
2002 122
 
2.2%
2003 122
 
2.2%
2004 122
 
2.2%
2005 122
 
2.2%
2006 122
 
2.2%
2007 122
 
2.2%
2008 122
 
2.2%
Other values (35) 4270
77.8%
ValueCountFrequency (%)
1976 122
2.2%
1977 122
2.2%
1978 122
2.2%
1979 122
2.2%
1980 122
2.2%
1981 122
2.2%
1982 122
2.2%
1983 122
2.2%
1984 122
2.2%
1985 122
2.2%
ValueCountFrequency (%)
2020 122
2.2%
2019 122
2.2%
2018 122
2.2%
2017 122
2.2%
2016 122
2.2%
2015 122
2.2%
2014 122
2.2%
2013 122
2.2%
2012 122
2.2%
2011 122
2.2%

Month
Categorical

Distinct4
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Memory size43.0 KiB
10
1395 
12
1395 
9
1350 
11
1350 

Length

Max length2
Median length2
Mean length1.7540984
Min length1

Characters and Unicode

Total characters9630
Distinct characters4
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row9
2nd row9
3rd row9
4th row9
5th row9

Common Values

ValueCountFrequency (%)
10 1395
25.4%
12 1395
25.4%
9 1350
24.6%
11 1350
24.6%

Length

Histogram of lengths of the category

Common Values (Plot)

ValueCountFrequency (%)
10 1395
25.4%
12 1395
25.4%
9 1350
24.6%
11 1350
24.6%

Most occurring characters

ValueCountFrequency (%)
1 5490
57.0%
0 1395
 
14.5%
2 1395
 
14.5%
9 1350
 
14.0%

Most occurring categories

ValueCountFrequency (%)
Decimal Number 9630
100.0%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
1 5490
57.0%
0 1395
 
14.5%
2 1395
 
14.5%
9 1350
 
14.0%

Most occurring scripts

ValueCountFrequency (%)
Common 9630
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
1 5490
57.0%
0 1395
 
14.5%
2 1395
 
14.5%
9 1350
 
14.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 9630
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
1 5490
57.0%
0 1395
 
14.5%
2 1395
 
14.5%
9 1350
 
14.0%

Day
Real number (ℝ)

Distinct31
Distinct (%)0.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean15.754098
Minimum1
Maximum31
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size43.0 KiB

Quantile statistics

Minimum1
5-th percentile2
Q18
median16
Q323
95-th percentile29
Maximum31
Range30
Interquartile range (IQR)15

Descriptive statistics

Standard deviation8.8077587
Coefficient of variation (CV)0.5590773
Kurtosis-1.1987167
Mean15.754098
Median Absolute Deviation (MAD)8
Skewness0.0027898929
Sum86490
Variance77.576614
MonotonicityNot monotonic
Histogram with fixed size bins (bins=31)
ValueCountFrequency (%)
1 180
 
3.3%
17 180
 
3.3%
30 180
 
3.3%
29 180
 
3.3%
28 180
 
3.3%
27 180
 
3.3%
26 180
 
3.3%
25 180
 
3.3%
24 180
 
3.3%
23 180
 
3.3%
Other values (21) 3690
67.2%
ValueCountFrequency (%)
1 180
3.3%
2 180
3.3%
3 180
3.3%
4 180
3.3%
5 180
3.3%
6 180
3.3%
7 180
3.3%
8 180
3.3%
9 180
3.3%
10 180
3.3%
ValueCountFrequency (%)
31 90
1.6%
30 180
3.3%
29 180
3.3%
28 180
3.3%
27 180
3.3%
26 180
3.3%
25 180
3.3%
24 180
3.3%
23 180
3.3%
22 180
3.3%

Mực nước KG
Real number (ℝ)

Distinct428
Distinct (%)7.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.4690546
Minimum5.49
Maximum12.22
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size43.0 KiB

Quantile statistics

Minimum5.49
5-th percentile5.71
Q15.98
median6.24
Q36.69
95-th percentile8.08
Maximum12.22
Range6.73
Interquartile range (IQR)0.71

Descriptive statistics

Standard deviation0.78896197
Coefficient of variation (CV)0.12195939
Kurtosis7.4953786
Mean6.4690546
Median Absolute Deviation (MAD)0.32
Skewness2.3271686
Sum35515.11
Variance0.62246099
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
6.13 70
 
1.3%
6.09 63
 
1.1%
6.07 63
 
1.1%
6.18 62
 
1.1%
6.15 61
 
1.1%
6.01 60
 
1.1%
5.85 60
 
1.1%
6.19 59
 
1.1%
5.95 59
 
1.1%
6.05 58
 
1.1%
Other values (418) 4875
88.8%
ValueCountFrequency (%)
5.49 2
 
< 0.1%
5.5 3
0.1%
5.51 5
0.1%
5.52 1
 
< 0.1%
5.53 3
0.1%
5.54 1
 
< 0.1%
5.55 3
0.1%
5.56 1
 
< 0.1%
5.57 3
0.1%
5.58 3
0.1%
ValueCountFrequency (%)
12.22 1
< 0.1%
11.99 1
< 0.1%
11.79 1
< 0.1%
11.67 1
< 0.1%
11.57 1
< 0.1%
11.4 1
< 0.1%
11.3 1
< 0.1%
11.15 1
< 0.1%
11.12 1
< 0.1%
11.04 1
< 0.1%

Mực nước LT
Real number (ℝ)

Distinct336
Distinct (%)6.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.58364324
Minimum-0.5
Maximum4.62
Zeros28
Zeros (%)0.5%
Negative646
Negative (%)11.8%
Memory size43.0 KiB

Quantile statistics

Minimum-0.5
5-th percentile-0.17
Q10.22
median0.48
Q30.82
95-th percentile1.72
Maximum4.62
Range5.12
Interquartile range (IQR)0.6

Descriptive statistics

Standard deviation0.58706116
Coefficient of variation (CV)1.0058562
Kurtosis3.6099695
Mean0.58364324
Median Absolute Deviation (MAD)0.29
Skewness1.4338967
Sum3204.2014
Variance0.3446408
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0.3 76
 
1.4%
0.34 74
 
1.3%
0.38 72
 
1.3%
0.42 70
 
1.3%
0.28 70
 
1.3%
0.46 69
 
1.3%
0.6 67
 
1.2%
0.5 67
 
1.2%
0.4 67
 
1.2%
0.44 64
 
1.2%
Other values (326) 4794
87.3%
ValueCountFrequency (%)
-0.5 49
0.9%
-0.48 2
 
< 0.1%
-0.46 1
 
< 0.1%
-0.45 4
 
0.1%
-0.44 5
 
0.1%
-0.43 1
 
< 0.1%
-0.42 2
 
< 0.1%
-0.4 1
 
< 0.1%
-0.39 3
 
0.1%
-0.38 3
 
0.1%
ValueCountFrequency (%)
4.62 1
< 0.1%
4.2 1
< 0.1%
3.95 1
< 0.1%
3.74 1
< 0.1%
3.72 1
< 0.1%
3.61 2
< 0.1%
3.6 1
< 0.1%
3.53 1
< 0.1%
3.52 1
< 0.1%
3.39 1
< 0.1%

Mực nước DH
Real number (ℝ)

HIGH CORRELATION
ZEROS

Distinct167
Distinct (%)3.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.28808743
Minimum-0.24
Maximum1.99
Zeros67
Zeros (%)1.2%
Negative386
Negative (%)7.0%
Memory size43.0 KiB

Quantile statistics

Minimum-0.24
5-th percentile-0.03
Q10.13
median0.26
Q30.4
95-th percentile0.7
Maximum1.99
Range2.23
Interquartile range (IQR)0.27

Descriptive statistics

Standard deviation0.24131484
Coefficient of variation (CV)0.83764445
Kurtosis4.563513
Mean0.28808743
Median Absolute Deviation (MAD)0.13
Skewness1.4425409
Sum1581.6
Variance0.058232851
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0.16 133
 
2.4%
0.18 133
 
2.4%
0.3 130
 
2.4%
0.2 122
 
2.2%
0.21 119
 
2.2%
0.24 117
 
2.1%
0.22 114
 
2.1%
0.25 114
 
2.1%
0.14 112
 
2.0%
0.28 110
 
2.0%
Other values (157) 4286
78.1%
ValueCountFrequency (%)
-0.24 1
 
< 0.1%
-0.22 2
 
< 0.1%
-0.21 2
 
< 0.1%
-0.2 2
 
< 0.1%
-0.19 6
0.1%
-0.18 1
 
< 0.1%
-0.17 7
0.1%
-0.16 8
0.1%
-0.15 6
0.1%
-0.14 7
0.1%
ValueCountFrequency (%)
1.99 1
 
< 0.1%
1.96 1
 
< 0.1%
1.76 1
 
< 0.1%
1.75 1
 
< 0.1%
1.73 1
 
< 0.1%
1.68 2
< 0.1%
1.66 1
 
< 0.1%
1.63 1
 
< 0.1%
1.58 2
< 0.1%
1.54 3
0.1%

Lượng mưa KG
Real number (ℝ)

HIGH CORRELATION
ZEROS

Distinct785
Distinct (%)14.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean13.912805
Minimum0
Maximum500
Zeros2489
Zeros (%)45.3%
Negative0
Negative (%)0.0%
Memory size43.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0.6
Q310.775
95-th percentile74.82
Maximum500
Range500
Interquartile range (IQR)10.775

Descriptive statistics

Standard deviation35.095403
Coefficient of variation (CV)2.5225253
Kurtosis31.040805
Mean13.912805
Median Absolute Deviation (MAD)0.6
Skewness4.783516
Sum76381.3
Variance1231.6873
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 2489
45.3%
0.2 89
 
1.6%
1 56
 
1.0%
0.4 52
 
0.9%
1.2 47
 
0.9%
0.3 46
 
0.8%
2 43
 
0.8%
0.8 41
 
0.7%
4 39
 
0.7%
0.1 36
 
0.7%
Other values (775) 2552
46.5%
ValueCountFrequency (%)
0 2489
45.3%
0.1 36
 
0.7%
0.2 89
 
1.6%
0.3 46
 
0.8%
0.4 52
 
0.9%
0.5 31
 
0.6%
0.6 33
 
0.6%
0.7 34
 
0.6%
0.8 41
 
0.7%
0.9 25
 
0.5%
ValueCountFrequency (%)
500 1
< 0.1%
396.8 1
< 0.1%
378.2 1
< 0.1%
320.9 1
< 0.1%
315.9 1
< 0.1%
314.9 1
< 0.1%
305.9 1
< 0.1%
300.7 1
< 0.1%
299.6 1
< 0.1%
297.4 1
< 0.1%

Lượng mưa LT
Real number (ℝ)

HIGH CORRELATION
ZEROS

Distinct773
Distinct (%)14.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean13.909126
Minimum0
Maximum686.6
Zeros2548
Zeros (%)46.4%
Negative0
Negative (%)0.0%
Memory size43.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0.4
Q310.375
95-th percentile73.455
Maximum686.6
Range686.6
Interquartile range (IQR)10.375

Descriptive statistics

Standard deviation36.717592
Coefficient of variation (CV)2.6398203
Kurtosis48.796685
Mean13.909126
Median Absolute Deviation (MAD)0.4
Skewness5.6253182
Sum76361.1
Variance1348.1816
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 2548
46.4%
0.3 63
 
1.1%
0.2 62
 
1.1%
0.5 59
 
1.1%
1 48
 
0.9%
0.1 46
 
0.8%
0.6 44
 
0.8%
1.2 42
 
0.8%
2 41
 
0.7%
1.5 39
 
0.7%
Other values (763) 2498
45.5%
ValueCountFrequency (%)
0 2548
46.4%
0.1 46
 
0.8%
0.2 62
 
1.1%
0.3 63
 
1.1%
0.4 36
 
0.7%
0.5 59
 
1.1%
0.6 44
 
0.8%
0.7 35
 
0.6%
0.8 27
 
0.5%
0.9 14
 
0.3%
ValueCountFrequency (%)
686.6 1
< 0.1%
437.8 1
< 0.1%
405.5 1
< 0.1%
397.6 1
< 0.1%
379.7 1
< 0.1%
371.3 1
< 0.1%
361.7 1
< 0.1%
338 1
< 0.1%
330.3 1
< 0.1%
318.6 1
< 0.1%

Lượng mưa DH
Real number (ℝ)

HIGH CORRELATION
ZEROS

Distinct732
Distinct (%)13.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean12.15518
Minimum0
Maximum746.9
Zeros2493
Zeros (%)45.4%
Negative0
Negative (%)0.0%
Memory size43.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0.2
Q37.6
95-th percentile63.155
Maximum746.9
Range746.9
Interquartile range (IQR)7.6

Descriptive statistics

Standard deviation35.282837
Coefficient of variation (CV)2.9026996
Kurtosis70.716706
Mean12.15518
Median Absolute Deviation (MAD)0.2
Skewness6.5966994
Sum66731.94
Variance1244.8786
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 2493
45.4%
0.1 154
 
2.8%
0.2 118
 
2.1%
0.3 77
 
1.4%
0.4 71
 
1.3%
0.5 63
 
1.1%
0.7 47
 
0.9%
1 41
 
0.7%
0.6 40
 
0.7%
0.8 37
 
0.7%
Other values (722) 2349
42.8%
ValueCountFrequency (%)
0 2493
45.4%
0.1 154
 
2.8%
0.2 118
 
2.1%
0.3 77
 
1.4%
0.32 1
 
< 0.1%
0.4 71
 
1.3%
0.5 63
 
1.1%
0.6 40
 
0.7%
0.7 47
 
0.9%
0.8 37
 
0.7%
ValueCountFrequency (%)
746.9 1
< 0.1%
554.6 1
< 0.1%
414.6 1
< 0.1%
342.5 1
< 0.1%
341.9 1
< 0.1%
338.2 1
< 0.1%
330.5 1
< 0.1%
329 1
< 0.1%
320.4 1
< 0.1%
315.7 1
< 0.1%

Interactions

Correlations

Auto

The auto setting is an interpretable pairwise column metric of the following mapping:
  • Variable_type-Variable_type : Method, Range
  • Categorical-Categorical : Cramer's V, [0,1]
  • Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
  • Numerical-Numerical : Spearman's ρ, [-1,1]
The number of bins used in the discretization for the Numerical-Categorical column pair can be changed using config.correlations["auto"].n_bins. The number of bins affects the granularity of the association you wish to measure.

This configuration uses the recommended metric for each pair of columns.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

YearMonthDayMực nước KGMực nước LTMực nước DHLượng mưa KGLượng mưa LTLượng mưa DH
01976915.79-0.100.020.00.00.0
11976925.75-0.110.000.00.00.0
21976935.73-0.120.000.00.00.0
31976945.74-0.14-0.020.00.00.0
41976955.74-0.14-0.010.012.00.0
51976965.75-0.14-0.0122.40.00.1
61976975.76-0.120.010.00.00.0
71976985.73-0.130.000.00.00.0
81976995.71-0.140.020.00.00.0
919769105.68-0.150.060.00.00.0
YearMonthDayMực nước KGMực nước LTMực nước DHLượng mưa KGLượng mưa LTLượng mưa DH
5480202012226.510.590.400.00.00.0
5481202012236.460.540.320.00.00.0
5482202012246.420.470.201.60.40.2
5483202012256.390.430.110.20.00.0
5484202012266.380.370.082.40.80.0
5485202012276.360.320.130.00.00.0
5486202012286.330.270.030.40.00.1
5487202012296.300.210.060.00.00.0
5488202012306.280.140.277.60.42.4
5489202012316.340.170.404.80.00.1